

## **Pipelining**

#### E. Sanchez, M. Sonza Reorda Politecnico di Torino

Dipartimento di Automatica e Informatica (DAUIN)

Torino - Italy

This work is licensed under the Creative Commons (CC BY-SA) License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/



#### Introduction

- Pipelining is an implementation technique whereby multiple instructions are overlapped in execution
- In a pipeline, different units (called stages) are completing different parts of different instructions in parallel.



#### **Definitions**

- The throughput of a pipelined processor is the number of instructions which <u>exit</u> the pipeline in the time unit
- All the pipeline stages are synchronized (they proceed to executing a new task all together); the time for executing one step is called *machine* cycle, and normally corresponds to one clock cycle
- The length of the machine cycle is determined by the slowest stage
- CPI stands for clock cycles per instruction.

#### Ideal pipeline

- In an ideal pipeline, all the stages are perfectly balanced (i.e., they require the same time)
- The throughput of an ideal pipeline (i.e., the number of instructions completed in a given time period) is
  - throughput<sub>pipelined</sub> = throughput<sub>unpipelined</sub> \* n being n the number of pipeline stages.

# **Example processor: Unpipelined Implementation**

- The execution of each instruction may be composed of at most 5 clock cycles:
  - Instruction fetch cycle (IF)
  - Instruction decode/register fetch cycle (ID)
  - Execution/effective address cycle (EX)
  - Memory access/branch completion cycle (MEM)
  - Write-back cycle (WB).

#### Instruction Fetch cycle

$$IR \leftarrow Mem[PC]$$
 $NPC \leftarrow PC + 4$ 

## Instruction Decode/ Register Fetch

```
A ← Regs[IR6..10]
B ← Regs[IR11..15]
Imm ← (([IR]<sub>16</sub>)<sup>16</sup>##IR16..31)
```



## Instruction Decode/ Register Fetch

```
A \leftarrow Regs[IR6..10]
B \leftarrow Regs[IR11..15]
Imm \leftarrow (([IR]<sub>16</sub>)<sup>16</sup>##IR16..31)
```

Fixed-field decoding: allows for decoding to be performed while registers are read

## Instruction Decode/ Register Fetch

```
A \leftarrow \text{Regs}[IR6..10]
```

 $B \leftarrow Regs[IR11..15]$ 

Imm  $\leftarrow (([IR]_{16})^{16} \# IR16..31)$ 

Sign extension

#### **Execution/Effective Address Cycle**

Memory reference

```
ALUOutput \leftarrow A + Imm;
```

- Register-Register ALU instruction
   ALUOutput ← A op B;
- Register-Immediate ALU instruction
   ALUOutput ← A op Imm;
- Branch

```
ALUOutput \leftarrow NPC + Imm;
Cond \leftarrow (A op 0);
```

# Memory Access/Branch Completion Cycle

Memory reference

```
LMD \leftarrow Mem[ALUOutput] or Mem[ALUOutput] \leftarrow B;
```

Branch

```
if (cond) PC \leftarrow ALUOutput else PC \leftarrow NPC;
```

#### Write-back Cycle

- Register-Register ALU instruction
   Regs[IR16..20] ← ALUOutput;
- Register-Immediate ALU instruction
   Regs[IR11..15] ← ALUOutput;
- Load instruction
   Regs[IR11..15] ← LMD;



#### **Behavior and optimizations**

- All instructions require 5 clock cycles
- A simple control unit is required to produce the signals required by the datapath.

## Example processor: basic pipelined version

- A new instruction is started at each clock cycle
- Different resources work on different instructions at the same time
- At every clock cycle, each resource can be used for one purpose, only. This means that
  - Separate instruction and data memories (i.e., caches) must be used
  - The register file is used in two stages: for reading (second half of the cc) in ID and for writing (first half of the cc) in WB. It must be designed to satisfy these requests during the same clock cycle.
  - The PC must be changed in the IF stage. What about branches?
- Pipeline registers must be added between stages.

### Pipelined data path



#### **Evolution in time**



#### Pipeline performance

- Pipelining increases the processor throughput without making single instructions faster.
- Single instruction processing is made slower due to the pipeline control overheads.
- The depth of a pipeline is limited by
  - the need for balanced stages
  - pipelining overhead (pipeline register delay and clock skew).

- Consider the unpipelined processor, and suppose that
  - The clock cycle is 1 ns
  - ALU operations and branches require 4 cycles
  - Memory operations require 5 cycles
  - The relative frequency of these operations is 40% (ALU), 20% (branches), and 40% (mem).
- The average instruction execution time is

Clock cycle 
$$\times$$
 average CPI  
= 1 ns  $\times$  ((0.4 + 0.2)  $\times$  4 + 0.4  $\times$  5)  
= 1 ns  $\times$  4.4  
= 4.4 ns

#### **Example (II)**

- Suppose that moving to the pipelined architecture slows down the clock of the slowest stage by 20%.
- The average instruction execution time is therefore 1.2 ns.
- The speedup introduced by pipelining is speedup = 4.4 ns / 1.2 ns = 3.7 times

#### **Pipeline Hazards**

- Hazards are situations that prevent an instruction from executing during its designated clock cycle.
- There are three classes of hazards:
  - structural hazards: coming from resource conflicts
  - data hazards: an instruction depends on the result of a previous instruction
  - control hazards: depend on branches and other instructions that change the PC.

#### **Stalls**

- One way of dealing with hazards is to force the pipeline to stall, i.e., to block instructions for one or more clock cycles.
- When an instruction is stalled:
  - the instructions following the stalled instruction are also stalled
  - the instructions preceding the stalled instruction continue.
- A stall causes the introduction of a bubble in the pipeline.

| Instruction         | Clock cycle number |    |    |       |     |     |    |     |     |     |
|---------------------|--------------------|----|----|-------|-----|-----|----|-----|-----|-----|
|                     | 1                  | 2  | 3  | 4     | 5   | 6   | 7  | 8   | 9   | 10  |
| Load instruction    | IF                 | ID | EX | MEM   | WB  |     |    |     |     |     |
| Instruction $i + 1$ |                    | IF | ID | EX    | MEM | WB  |    |     |     |     |
| Instruction $i + 2$ |                    |    | IF | ID    | EX  | MEM | WB |     |     |     |
| Instruction $i + 3$ |                    |    |    | stall | IF  | ID  | EX | MEM | WB  |     |
| Instruction $i + 4$ |                    |    |    |       |     | IF  | ID | EX  | MEM | WB  |
| Instruction $i + 5$ |                    |    |    |       |     |     | IF | ID  | EX  | MEM |
| Instruction $i + 6$ |                    |    |    |       |     |     |    | IF  | ID  | EX  |

Suppose that only one access to memory can happen during each clock cycle: therefore, fetch of instruction i+3 can not be performed here and must be delayed.

|                     |    |    |    | Clock C | yr  |     |    |     |     |     |
|---------------------|----|----|----|---------|-----|-----|----|-----|-----|-----|
| Instruction         | 1  | 2  | 3  | 4       | 5   | 6   | 7  | 8   | 9   | 10  |
| Load instruction    | IF | ID | EX | MEM     | WB  |     |    |     |     |     |
| Instruction $i + 1$ |    | IF | ID | EX      | MEM | WB  |    |     |     |     |
| Instruction $i + 2$ |    |    | IF | ID      | EX  | MEM | WB |     |     |     |
| Instruction $i + 3$ |    |    |    | stall   | IF  | ID  | EX | MEM | WB  |     |
| Instruction $i + 4$ |    |    |    |         |     | IF  | ID | EX  | MEM | WB  |
| Instruction $i + 5$ |    |    |    |         |     |     | IF | ID  | EX  | MEM |
| Instruction $i + 6$ |    |    |    |         |     |     |    | IF  | ID  | EX  |

As a consequence of the stall, no instruction is completed at clock cycle #8.

| Instruction         | Clock cycle num |    |    |       |     |     |    |     |     |     |
|---------------------|-----------------|----|----|-------|-----|-----|----|-----|-----|-----|
|                     | 1               | 2  | 3  | 4     | 5   | 6   | 7  | 8   |     | 10  |
| Load instruction    | IF              | ID | EX | MEM   | WB  |     |    |     |     |     |
| Instruction $i + 1$ |                 | IF | ID | EX    | MEM | WB  |    |     |     |     |
| Instruction $i + 2$ |                 |    | IF | ID    | EX  | MEM | WB | ,   |     |     |
| Instruction $i + 3$ |                 |    |    | stall | IF  | ID  | EX | MEM | WB  |     |
| Instruction $i + 4$ |                 |    |    |       |     | IF  | ID | EX  | MEM | WB  |
| Instruction $i + 5$ |                 |    |    |       |     |     | IF | ID  | EX  | MEM |
| Instruction i + 6   |                 |    |    |       |     |     |    | IF  | ID  | EX  |

#### Structural hazards

 They may happen when some pipeline unit is not able to execute all the operations scheduled for a given cycle.

#### Examples:

- A given unit is not able to complete its task in one clock cycle
- The pipeline owns only one register-file write port, but there are cycles in which two register writes are required
- The pipeline refers to a single-port memory, and there are cycles in which different instructions would like to access to the memory together.

In this clock cycle the processor accesses twice to memory (one to data mem and the other to instr mem)

|                     | Clock cycle number |    |    |       |     |     |    |     |     |     |  |  |
|---------------------|--------------------|----|----|-------|-----|-----|----|-----|-----|-----|--|--|
| Instruction         | 1                  | 2  | 3  | 4     | 5   | 6   | 7  | 8   | 9   | 10  |  |  |
| Load instruction    | IF                 | ID | EX | MEM   | WB  |     |    |     |     |     |  |  |
| Instruction $i + 1$ |                    | IF | ID | EX    | MEM | WB  |    |     |     |     |  |  |
| Instruction $i + 2$ |                    |    | IF | ID    | EX  | MEM | WB |     |     |     |  |  |
| Instruction $i + 3$ |                    |    |    | stall | IF  | ID  | EX | MEM | WB  |     |  |  |
| Instruction $i + 4$ |                    |    |    |       |     | IF  | ID | EX  | MEM | WB  |  |  |
| Instruction $i + 5$ |                    |    |    |       |     |     | IF | ID  | EX  | MEM |  |  |
| Instruction $i + 6$ | _                  |    |    |       |     |     |    | IF  | ID  | EX  |  |  |

If the processor includes two caches, the stall can be avoided

|                     | Clock cycle number |    |    |       |     |     |    |     |     |     |  |
|---------------------|--------------------|----|----|-------|-----|-----|----|-----|-----|-----|--|
| Instruction         | 1                  | 2  | 3  | 4     | 5   | 6   | 7  | 8   | 9   | 10  |  |
| Load instruction    | IF                 | ID | EX | MEM   | WB  |     |    |     |     |     |  |
| Instruction $i + 1$ |                    | IF | ID | EX    | MEM | WB  |    |     |     |     |  |
| Instruction $i + 2$ |                    |    | IF | ID    | EX  | MEM | WB |     |     |     |  |
| Instruction $i + 3$ |                    |    |    | stall | IF  | ID  | EX | MEM | WB  |     |  |
| Instruction $i + 4$ |                    |    |    |       |     | IF  | ID | EX  | MEM | WB  |  |
| Instruction $i + 5$ |                    |    |    |       |     |     | IF | ID  | EX  | MEM |  |
| Instruction $i + 6$ |                    |    |    |       |     |     |    | IF  | ID  | EX  |  |

In this clock cycle the register file is accessed twice

|                          | Clock cycle numb |    |    |       |     |      |    |     |     |     |  |
|--------------------------|------------------|----|----|-------|-----|------|----|-----|-----|-----|--|
| Instruction              | 1                | 2  | 3  | 4     | 5   | 6    | 7  | 8   | 9   | 10  |  |
| Load instruction         | IF               | ID | EX | MEM   | WB  |      |    |     |     |     |  |
| Instruction $i + 1$      |                  | IF | ID | EX    | MEM | WB   |    |     |     |     |  |
| Instruction $i + 2$      |                  |    | IF | ID    | EX  | MEM  | WB |     |     |     |  |
| Instruction $i + 3$      |                  |    |    | stall | IF  | ID   | EX | MEM | WB  |     |  |
| Instruction $i + 4$      |                  |    |    |       |     | IF / | ID | EX  | MEM | WB  |  |
| Instruction $i + 5$      |                  |    |    |       |     |      | IF | ID  | EX  | MEM |  |
| Instruction <i>i</i> + 6 |                  |    |    |       |     |      |    | IF  | ID  | EX  |  |

If the register file is dual-port, the stall can be avoided

|                          | Clock cycle numb |    |    |       |     |      |    |     |     |     |  |
|--------------------------|------------------|----|----|-------|-----|------|----|-----|-----|-----|--|
| Instruction              | 1                | 2  | 3  | 4     | 5   | 6    | 7  | 8   | 9   | 10  |  |
| Load instruction         | IF               | ID | EX | MEM   | WB  |      |    |     |     |     |  |
| Instruction $i + 1$      |                  | IF | ID | EX    | MEM | WB   |    |     |     |     |  |
| Instruction $i + 2$      |                  |    | IF | ID    | EX  | MEM  | WB |     |     |     |  |
| Instruction $i + 3$      |                  |    |    | stall | IF  | ID   | EX | MEM | WB  |     |  |
| Instruction $i + 4$      |                  |    |    |       |     | IF / | ID | EX  | MEM | WB  |  |
| Instruction $i + 5$      |                  |    |    |       |     |      | IF | ID  | EX  | MEM |  |
| Instruction <i>i</i> + 6 |                  |    |    |       |     |      |    | IF  | ID  | EX  |  |

#### Removing Structural Hazards

- This requires adding new hardware or improving the existing one.
- The designer has to trade-off between performance and cost, basing on the frequency of occurrence of structural hazards.

- Load structural hazards happen when two instructions contemporarily try to make a memory access to a single-port memory.
- Assume that:
  - 40% of instructions make access to memory
  - the machine with structural hazard has a clock
     1.05 times faster than the one without.
- How much faster is the machine without structural hazard?

#### **Solution**

- For the machine without structural hazard:
  - Average Instruction Time = CPI × Clock Cycle Time
- For the machine with structural hazard:
  - Average Instruction Time = CPI × Clock Cycle Time

$$= (1 + 0.4 \times 1) \times \frac{\text{Clock Cycle Time}_{\text{ideal}}}{1.05}$$

$$= 1.3 \times \text{Clock Cycle Time}_{\text{ideal}}$$

#### **Data hazards**

- Overlapping the execution of instructions, as it is done by pipelining, changes the order of read/write accesses to operands.
- This can result in:
  - wrong results
  - undeterministic behavior.

Let consider the following code fragment:

ADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

XOR R10, R1, R11











### Interrupt effects

- If an interrupt occurs during the execution of a critical piece of code (from the point of view of data hazards) correctness may be lost or not, depending on the cases.
- This may cause an undeterministic behavior.

### Overcoming data hazards effects

- The wrong results produced by data hazards can be avoided:
  - by implementing a forwarding (or bypassing) technique
  - by stalling the instructions requiring the data until they are available.

### **Forwarding**

- A special hardware in the datapath detects when a previous ALU operation should write the register corresponding to the source of the current ALU operation.
- In this case, the hardware selects the ALU result as the ALU input rather than the value from the register file.
- The hardware must be able
  - to forward a data from any of the previously started instructions (provided that they didn't already write the data in its final location)
  - not to forward anything, if the following instruction is stalled, or an interrupt occurred.





# Generalizing the forward technique

In order to always avoid stalling, forwarding should be made possible between any pipeline register to any input of any functional unit.

#### **Example**

ADD R1, R2, R3

LD R4, 0(R1)

SD R4, 12(R1)

Forwarding must occur to ALU and data memory inputs.



### **Causes of Data Hazards**

- A hazard is created whenever there is dependence between instructions, and they are close enough that the overlap caused by pipelining would change the order of access to an operand.
- In general, this can happen for
  - register operands
  - memory operands: this is possible if
    - accesses to memory by load and store are not made in the same stage
    - execution can proceed while an instruction waits for a cache miss to be solved.

## **Data Hazards Requiring Stalls**

Not all potential data hazards can be solved through data forwarding.

#### **Example**

LD R1, 0(R2)

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

# Forwarding-based solution



### Solution with stall



### Implementing the control

- This requires that at each clock cycle:
  - All tests for detecting a possible data hazard concerning an instruction are performed when this is in the ID stage
  - If a data hazard is detected, two actions can be alternatively taken:
    - the appropriate forwarding is activated
    - the instruction is stalled before entering the stage where operands are not available (i.e., before being issued).

### **Load Interlock Detection**

| Situation Example code            |                          | •                                             | Action                                                                                                                                                    |  |  |  |
|-----------------------------------|--------------------------|-----------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| No dependence                     | LD<br>DADD<br>DSUB<br>OR | R1,45(R2)<br>R5,R6,R7<br>R8,R6,R7<br>R9,R6,R7 | No hazard possible because no dependence exists on R1 in the immediately following three instructions.                                                    |  |  |  |
| Dependence requiring stall        | LD<br>DADD<br>DSUB<br>OR | R1,45(R2)<br>R5,R1,R7<br>R8,R6,R7<br>R9,R6,R7 | Comparators detect the use of R1 in the DADD and stall the DADD (and DSUB and OR) before the DADD begins EX.                                              |  |  |  |
| Dependence overcome by forwarding | LD<br>DADD<br>DSUB<br>OR | R1,45(R2)<br>R5,R6,R7<br>R8,R1,R7<br>R9,R6,R7 | Comparators detect use of R1 in DSUB and forward result of load to ALU in time for DSUB to begin EX.                                                      |  |  |  |
| Dependence with accesses in order | LD<br>DADD<br>DSUB<br>OR | R1,45(R2)<br>R5,R6,R7<br>R8,R6,R7<br>R9,R1,R7 | No action required because the read of R1 by 0R occurs in the second half of the ID phase, while the write of the loaded data occurred in the first half. |  |  |  |

# Load Interlock Detection (cont'd)

 The following checks have to be performed when a Load instruction is in the EX stage

| Opcode field of ID/EX (ID/EX.IR <sub>05</sub> ) | Opcode field of IF/ID (IF/ID.IR <sub>05</sub> ) | Matching operand fields      |  |  |
|-------------------------------------------------|-------------------------------------------------|------------------------------|--|--|
| Load                                            | Register-register ALU                           | ID/EX.IR[rt] == IF/ID.IR[rs] |  |  |
| Load                                            | Register-register ALU                           | ID/EX.IR[rt] == IF/ID.IR[rt] |  |  |
| Load                                            | Load, store, ALU immediate, or branch           | ID/EX.IR[rt] == IF/ID.IR[rs] |  |  |

If operands match, a data hazard is detected and as a result, the control unit must insert a pipeline stall and prevent the instructions in IF and ID stages from advancing.

## Introducing a stall

- Introducing a stall in the EX stage can be done in one of the following ways:
  - forcing all 0s in the control portion of the ID/EX pipeline register (corresponding to a nop instruction)
  - forcing the IF/ID pipeline register to maintain the current value.

### **Forwarding Logic**

- Forwarding can be implemented
  - from the ALU or data memory output
  - to ALU inputs, data memory inputs, or the zero detection unit (branch instructions).
- The forwarding logic must compare:
  - the destination fields of the IR contained in the EX/MEM and MEM/WB registers with
  - the source fields of the IR contained in the IF/ID, ID/EX and EX/MEM registers.

# Forwarding to the ALU inputs

| Pipeline<br>register<br>containing<br>source<br>instruction | Opcode<br>of source<br>instruction | Pipeline<br>register<br>containing<br>destination<br>instruction | Opcode of<br>destination<br>instruction                         | Destination<br>of the<br>forwarded<br>result | Comparison<br>(if equal then<br>forward)               |  |
|-------------------------------------------------------------|------------------------------------|------------------------------------------------------------------|-----------------------------------------------------------------|----------------------------------------------|--------------------------------------------------------|--|
| EX/MEM Register-<br>register ALU                            |                                    | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | $EX/MEM.IR_{1620} = ID/EX.IR_{610}$                    |  |
| EX/MEM                                                      | Register-<br>register ALU          | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | EX/MEM.IR <sub>1620</sub> = ID/EX.IR <sub>1115</sub>   |  |
| MEM/WB                                                      | Register-<br>register ALU          | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | MEM/WB.IR <sub>1620</sub> = ID/EX.IR <sub>610</sub>    |  |
| MEM/WB                                                      | Register-<br>register ALU          | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | MEM/WB.IR <sub>1620</sub> = ID/EX.IR <sub>1115</sub>   |  |
| EX/MEM                                                      | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | EX/MEM.IR <sub>1115</sub> = ID/EX.IR <sub>610</sub>    |  |
| EX/MEM                                                      | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | EX/MEM.IR <sub>1115</sub> = ID/EX.IR <sub>1115</sub>   |  |
| MEM/WB                                                      | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | MEM/WB.IR <sub>1115</sub> = ID/EX.IR <sub>610</sub>    |  |
| MEM/WB                                                      | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | MEM/WB.IR <sub>1115</sub> = ID/EX.IR <sub>1115</sub>   |  |
| MEM/WB                                                      | Load                               | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | MEM/WB.IR <sub>1115</sub> =<br>ID/EX.IR <sub>610</sub> |  |
| MEM/WB                                                      | Load                               | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | MEM/WB.IR <sub>1115</sub> = ID/EX.IR <sub>1115</sub>   |  |

# Hardware changes to support forwarding to ALU inputs



### **Control hazards**

- They are due to branches (conditional and unconditional), which may change the PC after the following instruction has been fetched already.
- In the case of <u>conditional</u> branches, the decision on whether the PC should be modified (branch taken) or not (branch untaken) can be taken even later.
- In the considered implementation, the PC is written with the target address (if the jump is taken) at the end of the ID stage.

### **Basic solution**

- A possible solution is based on stalling the pipeline as soon as a branch instruction is detected (ID stage) by:
  - decide earlier whether the branch has to be taken or not
  - compute earlier the new PC value.

# Basic pipelined data path



# Modified pipeline



## **Modified pipeline**



## **Modified pipeline**



| <b>Branch instruction</b> | IF | ID | EX | MEM | WB |     |     |
|---------------------------|----|----|----|-----|----|-----|-----|
| <b>Branch successor</b>   |    | IF | IF | ID  | EX | MEM | WB  |
| <b>Branch successor+1</b> |    |    |    | IF  | ID | EX  | MEM |
| <b>Branch successor+2</b> |    |    |    |     | IF | ID  | EX  |

Branch instruction IF ID
Branch successor
Branch successor+1
Branch successor+2

MEM WB
ID EX MEM WB
IF ID EX MEM
IF ID EX

In this clock period the fetch stage fetches the following instruction (as if the branch is not taken).

 $\mathbf{E}\mathbf{X}$ 

IF

IF

Branch instruction
Branch successor
Branch successor+1
Branch successor+2

ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM

IF ID EX

In this clock period the fetch stage fetches the right instruction (which depends on the branch result).

**Branch instruction** IF ID  $\mathbf{E}\mathbf{X}$ MEM WB **Branch successor** IF IF  $\mathbf{E}\mathbf{X}$ MEM WB ID IF ID EX **Branch successor+1 MEM Branch successor+2** IF ID EX

This operation is ALWAYS useless.

### Improved solutions

- There are several techniques for managing branches in a pipeline:
  - freezing the pipeline
  - predict untaken
  - predict taken
  - delayed branch.

## Freezing the pipeline

- It is the previously proposed solution: the pipeline is stalled (or flushed) as soon as a branch instruction is detected, and until the decision about the branch is known.
- It is the simplest solution to implement.

#### **Predict untaken**

- This technique
  - assumes the branch is not taken
  - avoid any change in the pipeline status until the branch decision has been taken
  - undo all the performed operations if the branch turns out to be taken.

## **Predict untaken**

| Untaken branch instruction | IF | ID | EX   | MEM  | WB   |      |     |     |    |
|----------------------------|----|----|------|------|------|------|-----|-----|----|
| Instruction $i + 1$        |    | IF | ID   | EX   | MEM  | WB   |     |     |    |
| Instruction $i + 2$        |    |    | IF   | ID   | EX   | MEM  | WB  |     |    |
| Instruction $i + 3$        |    |    |      | IF   | ID   | EX   | MEM | WB  |    |
| Instruction $i + 4$        |    |    |      |      | IF   | ID   | EX  | MEM | WB |
|                            |    |    |      |      |      |      |     |     |    |
| Taken branch instruction   | IF | ID | EX   | MEM  | WB   |      |     |     |    |
| Instruction $i + 1$        |    | IF | idle | idle | idle | idle |     |     |    |
| Branch target              |    |    | IF   | ID   | EX   | MEM  | WB  |     |    |
| Branch target + 1          |    |    |      | IF   | ID   | EX   | MEM | WB  |    |
| Branch target + 2          |    |    |      |      | IF   | ID   | EX  | MEM | WB |
|                            |    |    |      |      |      |      |     |     |    |

## **Predict untaken**

| Untaken                             | branch instruction  | IF   | ID   | EX   | MEM  | WB   |      |     |            |    |
|-------------------------------------|---------------------|------|------|------|------|------|------|-----|------------|----|
| Instructi                           | on <i>i</i> + 1     |      | IF   | ID   | EX   | MEM  | WB   |     |            |    |
| Instructi                           | on $i+2$            |      |      | IF   | ID   | EX   | MEM  | WB  |            |    |
| Instruction $i + 3$                 |                     |      |      |      | IF   | ID   | EX   | MEM | WB         |    |
| Instructi                           | on <i>i</i> + 4     |      |      |      |      | IF   | ID   | EX  | MEM        | WB |
| Taken branch instruction IF ID      |                     | ID   | EX   | MEM  | WB   |      |      |     |            |    |
| Instruction $i + 1$ IF              |                     |      | IF   | idle | idle | idle | idle |     |            |    |
| This result can also be obtained by |                     |      | also | / IF | ID   | EX   | MEM  | WB  |            |    |
|                                     |                     |      |      | r/   | IF   | ID   | EX   | MEM | WB         |    |
|                                     |                     |      |      |      |      | IF   | ID   | EX  | <b>MEM</b> | WB |
|                                     | turning the already |      |      |      |      |      |      |     |            |    |
| fetched instruction                 |                     |      |      |      |      |      |      |     |            |    |
|                                     | into a ı            | nop. |      | )    |      |      |      |     |            |    |

## **Predict untaken**



### **Predict taken**

 If the target address is known before the branch outcome, it may be possible to assume the branch as taken.

# Compiler role

If the hardware supports the predict taken or predict untaken scheme, the compiler can improve performance by generating code which maximizes the chance for the processor to make the right prediction.

### **Example**

Considering the loop implementation, the for scheme is suitable for the predict untaken scheme, the do while for the predict taken.

# **Delayed branch**

- This technique is based on filling the slot after the branch instruction (named branch-delay slot) with instructions which have to be executed no matter the branch outcome.
- It is up to the compiler to fill each branchdelay slot with the right instructions.
- The processor does nothing special when a branch instruction is decoded.

# **Example**



E. Sanchez

# Delayed-branch scheduling effectiveness

 It depends on the compiler ability in finding the right instructions to put in the delay slots.

 Using this technique, only about 30% of branches do produce a penalty.

### **Trend**

- With the advent of deeply pipelined processors, the delay slots are becoming longer, and the advantages of delayedbranches smaller.
- Therefore, several current RISC architectures do not support any more delayed branches.

### **EXCEPTIONS**

- Exceptions are events that modify the normal execution order of the program.
- Exceptional events (exceptions, interrupts, or faults) are more complex to handle in pipelined architectures because several instructions are being executed at a time.

## **Exception sources**

- Possible causes of exceptions are:
  - I/O device request
  - Operating system call by a user program
  - Tracing instruction execution
  - Breakpoint (programmer-requested interrupt)
  - Integer arithmetic overflow or underflow
  - FP arithmetic anomaly
  - Page fault
  - Misaligned memory accesses
  - Memory-protection violation
  - Undefined instruction
  - Hardware malfunction
  - Power failure.

- Synchronous vs. asynchronous
- User requested vs. coerced
- User maskable vs. user nonmaskable
- Within vs. between instructions
- Resume vs. terminate.

- Synchronous vs. asynchronous
- User requested vs. coerced
- User maskab

<u>or nonmaskahle</u>

- Within vs. β
- Resume vs

An exception is synchronous if it can be triggered always at the same position in the code. Asynchronous exceptions are normally generated by external devices.

- Synchronous vs. asynchronous
- User requested vs. coerced
- User maskable vs. user nonmaskable
- Within vs. Þ垂
- Resume

User requested exceptions are similar to procedures.

actructions

Coerced exceptions are out of the control of the user program.

- Synchronous vs. asynchronous
- User requested vs. coerced
- User maskable vs. user nonmaskable
- Within between instructions
- Resume vs.

The user can normally force the hardware not to answer to exception requests.

# **Exception CI**

- Synchronous vş
- User requested
- User maskable/
- Within vs. between instructions
- Resume vs. terminate.

Exceptions can occur either within or between instructions. In the former case they are generated by the instruction itself.

# **Exception Classifi**

- Synchronous vs. asynch
- User requested vs. co/
- User maskable vs
- Within vs. bet en instruction
- Resume vs. terminate.

After activating an exception, the program can either terminate, or execute something and then resume.

### Restartable machines

- Restartable machines are able to handle an exception, save the state, and restart without affecting the execution of the program.
- Nearly all processors nowadays are restartable machines.

# Stopping the execution

- When an exception occurs, the pipeline must execute the following steps:
  - force a trap instruction into the pipeline on the next IF stage
  - until the trap is taken, turn off all writes for the instruction that raised the exception (i.e., the faulty instruction) and for all the following instructions in the pipeline
  - when the exception-handling procedure receives control, it immediately saves the PC of the faulty instruction.

# Restarting the execution

 After the exception has been handled, special instructions return the machine from the exception by reloading the PC and restarting the instruction stream.

# **Precise exceptions**

- A processor has precise exceptions if the pipeline can be stopped so that
  - the instructions just before the instruction that triggered the exception are completed, and
  - the instructions following the instruction that triggered the exception can be restarted from scratch.
- Restarting after an exception may be really hard if exceptions are not handled precisely.
- Precise exception handling is a must for most architectures, at least for integer instructions.

## Cost of precise exceptions

- Some processors (Alpha 21064, MIPS R800) implement precise exceptions in debug mode, only.
- This mode is about 10 times slower than usual normal mode.

# MIPS: Possible sources of Exceptions

Pipeline stage Cause of exception

IF Page fault on instruction fetch

Misaligned memory access

Memory-protection violation

ID Undefined or illegal opcode

EX Arithmetic exception

MEM Page fault on data fetch

Misaligned memory access

Memory-protection violation

WB None

# **Contemporary exceptions**

Example

LD IF ID EX MEM WB

DADD IF ID EX MEM WB

A data page fault exception can occur in the MEM stage of the LD instruction, and an arithmetic exception in the EX stage of the DADD instruction.

The first exception is processed and, if its cause is removed, the second exception is handled.

# **Exception order**

There are cases in which two exceptions can occur in the opposite order of the instructions they relate to.



exception caused by instruction page fault o exceptions can second instruction occurs before an exception caused by a data bf the instructions page fault in the first instruction. Example (MEM)WB IF EX ID **MEMWB** DADD

# **Exception order (cont'd)**

- A possible solution is the following:
  - a status flag is associated to each instruction in the pipeline
  - if an instruction causes an exception, the status flag is set
  - if the status flag is set, the instruction can not perform any write operation
  - when an instruction reaches the last stage, and its status flag is set, an exception is triggered.

# Instruction set complications

- When an instruction is guaranteed to complete it is called *committed*.
- Some machines have instructions that change the state before they are committed (e.g., those using autoincrement addressing modes).
- If one of these instructions is aborted because of an exception, it leaves the machine state altered.
  - Implementing precise exceptions is thus very difficult.
- A possible solution is based on allowing to roll-back all the state changes made by an instruction before it is committed.

# Instruction set complications (cont'd)

- Instructions implicitly updating condition codes create complications:
  - they can cause data hazards
  - they require to be saved and restored in the case of an exception
  - they make more difficult the task of the compiler for filling possible delay slot between the instruction writing the condition codes and the branch one.

# Instruction set complications (cont'd)

- Complex instructions are difficult to implement in a pipeline. Forcing them to have the same length can hardly be achieved.
- Sometimes these problems are solved by pipelining the microinstructions implementing each instruction.

## References

J.L. Hennessy, D.A. Patterson

Computer Architecture: a Quantitative Approach

Morgan Kaufmann Publishers, Inc., V Edition, 2012

